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COMPUTER METHOD AND APPARATUS advantage over synchronous writes by decoupling the 

FOR ASYNCHRONOUS ORDERED execution of processors from disk subsystems and allowing 

OPERATIONS more overlap between them. Delayed writes Improve the 

decoupling and 6erve to reduce the aggregate number of disk 
This application is a continuation of application Sen No. 5 writes by allowing multiple modifications of the same buffer 
08/280370 filed Jul. 26, 1994 and now abandoned. to be propagated to the disk with a single disk write. 

BACKGROUND <W top TMVP^rrrn v Despite the performance advantage of using asynchro- 

BACKGROUND OF THE INVENTION nous and delayed writes over synchronous writes as 

The present invention relates to computer systems and to described above, many file system Implementations employ 
file system implementations for computer operating systems 10 synchronous write operations for recording changes to file 

and methods and apparatus used by file systems for con- system structural (administrative) data. Synchronous writing 

trolling the order of operations, such as the order in which is used 80 mat me filc system implementation can regulate 

information is updated on secondary storage, to realize gains me or< ^ er m which structural changes appear on the disk. By 

in performance. controlling the order in which modifications of structural 

Computer systems are composed of hardware and soft- " ^ m writteD t0 a 6,(5 system implementation 

ware. The hardware includes one or more processors, typi- achicvcs ^ capability to perform file system repairs in the 

cally a central processing unit (CPU), main storage or event a system crash occurs a sequence of 

memory, secondary storage, and other input/output (I/O) structural changes can complete and reach a self-consistent 

devices. The software includes an operating system and user orgalli2ation of 61(5 system structural information. The spc- 

(application) programs. The computer system executes user 20 cific ^""^"to for ordering updates of structural data 

programs in the hardware under the control of the operating Y 8 ^ accordin « to file system implementation as described 

system. The operating system controls the operation of 5 cxam P lc ' m M * Bacn » tThc Design of the UNIX pper- 

secondary storage devices and other I/O devices such as adxig Systcm -" Prentice-Hall, EngJewood Cliffs, 1986. An 

terminals through a set of software modules called device „< cxam P lc of a utility for repairing file systems following a 

drivers. 25 CTa5n » mc fsck program, is described in M. MeKusiclc W. 

In modem computer systems, secondary storage system* ^^fJ^^Z 7 ^ "J*™ 0 ^ 

such as disks have become performance bottlenecks because r n v^ ^ System Manager's ManuaJ-^.3 

processors have higher speeds than disks. Various methods DS f " Verswn ' USENK ' 198d 

have been used to minimize the impact of disk subsystems M , dcscnbed above ' """V me s y stem implementations 

on overall system performance. For example, some disk , * ^ doTm 0ldered for maintaining stnic- 

controUers employ large random access memories as disk tUral l . 0r(ler rq>airabnity and therefore they employ 

caches in order to reduce the number of slower disk ^ nchronous writes that maintain the order of disk writes, 

accesses. Operating system device drivers use a variety of I.! use of s 3^<*««"«" writes, however, limits system 

algorithms to schedule disk requests so that they can be 35 P af " mance s"»ce disks and other secondary storage devices 

serviced with minimum mechanical movement or delays ue a relanve to processors and main memory. File 

within the disk hardware. Some file system implementations T"** ^ be desigIled t0 tni ^ n ^ number of 

log their operations so that it is not critical to have all dMtmct updates needed for accomplishing a consistent 

intermediate information updates applied immediately to reor 8»mza«ion of structure. Alternative techniques for 

secondary storage. See, for example. Mendel Rosenblum « re P aJrab ? lt y' such 85 lo 88>ng, Provide the ability to 

and John K. Ousterhout, 'The Design and Implementation ff" . 40 mcom P le,e ««iuence of disk modifications, 

of a Log Structured File System," Proceedings of the 13rt alternatives, while being beneficial to performance, 

ACM Symposium on Operating System Principles (October '"If 0Verly burden «»ne due to loss of media or 

1991), and Robert Hagmann. "Reimplementing the Cedar software compatibility. Accordingly, there is a need for an 

File System using Logging and Group Commit," Proceed- 45 improved °P clatijl 8 svstem mat provides control of write 

Ings of the Uth ACM Symposium on Operating Systems ""J™* w ? Ul0Ut Performance penalty of synchronous 

Principles (November 1987) writing and without mandating special hardware, new media 

By way of background, three types of writes exist for f0mfltS " ^ changC8 ' 
writing information to disk storage, namely. Synchronous, SUMMARY OF THE INVENTION 

Asynchronous, and Delayed writes. With a synchronous so The present invention applies to a computer system 

write, the computer system suspends execution of the pro- having data organized in files, having a secondary storace 

gram that caused the write to occur. When the write for storing files, having a primary storage, and having ok mm 

completes, the program is allowed to continue. With an more types of file subsystems (file system implementations) 

asynchronous write, the computer system permits the pro- for controlling transfer of files between primary storage and 

^.u ? ntmU ! , . aftCT cn «« ueuin g «* ^uest for writing ss secondary storage. A subset of writes to secondary storase 

with the device drivers that manage the operation of disks. are performed using a Delayed Ordered Write (DOW) 

to this case, the program can make further progress, even subsystem, which makes it possible for any file system to 

though «he actual information to be written is not yet stored control the order in which modifications are propagated to 

to disk. Delayed I writing is a special type of asynchronous disk. The DOW subsystem consists of two parts. The first 

write, m which the execution of the program is allowed to «o part is a specification interface, which a file system imple- 

condnue without enqueuing the write request with the mentation or any other kernel subsystem can use to indicate 

!S T' *t I** iB mCm0ly th " h s «l uential o"^ between a modification and some other 

modified during the write is marked as needing to be written modification of file system structural data. The use of the 

todutandtherequestisprc^gatedtomedevicedrivenby specification interface by any file system implementation 

^ operating system at a later time. Generally, the operating 65 results implicitly in the construction of an order store in 

system ensures that the request propagates within a finite primary storage, that records the ordering interdependence 

time interval. Asynchronous writes achieve a performance among different buffers affected by the modifications The 
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second part of DOW subsystem is a mechanism that ensures rrrr, ia ^-*i-f. »k . . 

*at *e dis* write options arc indeed J&V&fZSS" " *** "**« 

accordance with the order store. The DOW subsystem pip „ a ' ™f ,■ 

accomplishes this task by both Intercepting and inittodne . IT' 11 .*^ lcts a block augnun of the processing in 
disk write operations, as needed, to control the actual s !z2 t Wn ^S, UeSt J s resefVed 6,1 an mode buffer ^re- 
sequence of writes presented to device drivers, sented by a DOW node. 

DOW improves computer system performance by reduo ^' n % I W0Ck dkgram of 1,16 Processing in 

Mg<BsktrafficasweUasmenumber<rfconte«switchesthat mm»l«^«^ii ^ 80,1 ^ are marked pruned and the A node is 

would be generated if synchronous writes were used for ™ fl °* ChaU '- 

ordering. The use of DOW for controlling the order of »o ™°' 13 d ^ fas a Wock diagram of the processing in 

structural updates does not require a structural redesign of a ? toe C aode 18 moved to dow flush chain after 

file system format, or changes to the system hardware or write node com Plett«- 

device drivers. Replacing synchronous writing by DOW- no - M depicts a block diagram of the processinc in 

maintained ordered writing provides large gains in system which the E and F nodes are moved to the dow flush chain 

performance. , 5 after buffer write completes for the C node. 

DOW is modular and is loosely coupled with other kernel ^ 15 fopids normalized system throughput for users 

subsystems, including the file system implementations that with 811(1 without delayed ordered writes 

use it. It does not require any modifications to other kernel TO. 16 depicts disk writes per user with and without 

subsystems or to the standard file system interfaces. Any file delayed ordered writes 

if any, among a sequence of structural changes in lieu of p, r ,« A L „ , ™ 

forcing synchronous writes. While DOW thus provides an 25 , . ^ picts a block dm « rain of a computer system 
alternate mechanism for regulating the disk writing, the Ale CB¥ ^ lng ordercd operations in a network environment 
system implementation retains control of the pohcy-^that is FKJ " 20 ^ icXs a block diagram of the processing for the 

control of which modifications should be ordered relative to com P uter svste ni of FIG. 20. 
each other. 

Thus, DOW allows a computer system to obtain the 30 r DETAILED DESCRIPTION 

performance advantage associated with delayed writes while Computer 87 8tcm General— FIG. 1 
retaining the file system repairability characteristics that ^ com Pu^ system 3 of FIG. 1 is composed of hard- 
come with ordered writing. ware 5 and software 6. The hardware 5 includes one or more 
The foregoing and other objects, features and advantages P"?* 8 *"* 10 * typically a central processing unit (CPU), 
of the invention will be apparent from the following detailed " nUm St0ragC U > "PU^tput (I/O) in the form of secondary 
description in conjunction with the drawings. storage 12 and other input/output devices 13-1, . . . , 13-N. 

The software 6 includes an operating system 14 and user 

BRIEF DESCRIPTION OF THE DRAWINGS (application) programs 15. The computer system 3 executes 

FIG. 1 depicts a block diagram of a computer svstem « £ ograms 15 * n *«*wiie 5 under control of the 
employing ordered operations including for examole 0pmtu« system 14. A common instance of the operating 
delayed ordered writes to secondary stoat P ' 3K? " * * c °P mtto * 

FTG. 2 depicts a block diagram of ti^perating system ^^Xf^S^ h 
software of the computer system of FIG 1 ZT^* u * UNIX °P cratul 8 system has flies 

PIG. 3 d** rwoc/diagnun of an example of me « ^S^SSS^SrSS^* 
S^S2 n^T^^r ^crsystemTfFIG' ttt££E£££ 

mc T^r t ^ USlD8 t SynChrOB0US Write! - ^ V stem > ta ^ of operation: user 21 £ 

for*; " IT™ ° f * direCtcd at * clic 2 °' ■ nd hardwarc S - ^^y. plication programs run at 

ror me MO. 3 example of the processing of a remove (nu) level, and request the services of the operatine svstem 

command that causes writes in the system of FIGS. 1 and 2 » by means of system calls. A system call interface 31 and 
ustng delayed ordered writes (DOW). libraries 22 exist for this purpose; the £S" iS 

MO. 5 depicts a block diagram of an example of the in wl,h ^ user programs IS and map system calls to 
processing of two touch commands and two remove com- Primitives that the kernel 20 recognizes, 
mands that cause writes in the system of FIGS. 1 and 2 using kmel 20 «ts as a resource manager for all of the 

synchronous writes. ss resources of hardware S of the computer system 3 of FIG 1 

FIG. 6 depicts a block diagram of a directed acyclic graph ^manty kernel has two functions. The first is to provide 
for the FIG. 5 example of the processing of commands that a ***** de 8 ree of devIce ^dependence to the application 
cause writes in the system of FIGS. 1 and 2 using delayed P ro 8r amt masking details of the various hardware 
ordered writes (DOW). resources needed during computation. The second is to 

FIG. 7 depicts an array of data structures that constitute * Sf* 0 ^ V J * ri0US su P ervisor y * nd control functions so that 
the operation (node) entries in an ordering store , n resources are scheduled as necessary. The kernel 

HO. 8 depicts an array of date structures that constitute Z . p , artWone<1 1 « 0 8 control subsystem 35, a file 

the order (link) entries in an ordering Zc. S« 33 ' * memWy ^ 

FIG. 9 depicts the manner in which predecessor and « n other supporting subsystems. 

dependant enlries are organized Zvp%Sn^Tc£ ^ SUb$yStem 35 ^ or ,enninates 

ing store. «ng poimers in the order- processes in response to explicit or implicit user requests 

controls process behavior, obtains the resources that are 
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^dSf f l! he f"- * ubsystems for plOCCSS »«nition. Software Corporation. 4800 Great America Parkway, Suite 

and provides means for niter-process communication. CMef 420. Santa Clara. Calif. 95054). Also, a file sysSpe tM 

Mnongflter Kources t h at a pW °ccds dun n^execution is used commonly for providing file system^ 

gWof the proces sors' 10 W of primary stor age between nuKhines connects by a nftworUC^™* 
? **** 0n s" 5535 ^" « NetworkRle System, see R. Sandberg, et al., "Deslmand 

u, ano otner racumes sucn at printers. Implementation of the Sun Network Ffle System." Proceed 

I M memory management subsystem 38 regulates the use ings of the USENIX 1985 iJrSS JSSf 

of primary storage for efficient, protected, and convenient June 1985. Since file system iw^STto Sto^^ 

access to programs and data, and coordinates with file specific file system type to^Ky^ulTSs,! 
management subsystem 32 so mat file data is transferred ,o in this specification applies wSb7«Se7wSS 

between primary and secondary storage as needed by pro- identification is important Ch 

cesses. In most UNIX operating system versions, the The transfer of data between primary storaee and «*ri,* 

memory management subsystem provides the convenient era! (I/O) devices sacT^^ZT^^i 

reference *e data or code that they need during execution, is of kernel modules called Device Drivers 34 The device 

UStSfl ^ U emi : l0yed I f0r racetin * ^ ^vers support two models of aSs^o fcc^S 

^cn^T£ DCa ?, y » OT /°—^y ^cnt for a devices. One of the two models is the block dtvKSd 

spcofic ^ocess. Tha s dlusion B supported by managing the 34-1. and is used for devices that can be addressed « 

mainstorage n small units of allocation, called pages, and sequences of blocks, in which the lagth ofT S is 

IST^ ^TT 8 ^""ST 8 brtW€eB ^ 20 ^^512 bytes. TheomerisSeiTcha^aSce 

fntowh^h^r 5 ( "J^l e f Se$) the ^ modcl 34 " 2 *** « ^ used for devices for S Mock 

into which the data accessed by these references reside. The device model does not apply 

pool of pages that are so used, as the variably allocatable The block ? q%1 fa usuaUy applied to disks and 

ZS« tOTn£d a P3ge CICbC appears * RG. lass ie ^SSfagHo^sg Iw g yj- 

oXr^L «,„ * • 25 o^extoreducej&tafflc. The sizes bu ffer s 

Over the duration that the operating system executes a usOT^erforming^nsferTfrcSOT to block devfc^ 

fpeafic process on a processor, the processor is said to be sora co nvenient multiples of Sti^ Z2T£j£?JF l 

process refers to the overall hardware and software state of i s performed eitherbVuiilno «Z» nr .11 n.„., * J. 

a Processor that makes it possible for the memory references 30 cache 41. or to rTgBl^S rftS S S£g?££ 

the physical addresses that contain the Instructions or data and a buff e r cache In cJcs Th^Th Jh^X.f T 3, 

k - „ , 41 me «lme that the process control 35 h eld in the bu ffer cac he" both the tile 
subsystem 35 dfcembades the process from execution on a s ystem 32 L th e mcmm .Za^^ S^^ 
processor. This "saved context" information must be subse- c ooperaovelv man.™ ih* n^^SF SagSgSg «,! 

S;t?; a ^ „ p b ^^^ 40 and the 

ite context, and restoring the context of a second process, so Thl TfiinctionTimplemented by block or character device 

Src&^a^^^'^ drives that kernZbsystems ™T*«£7£Z 

tZI*£™Z? ! ,! ^ data traasfcr 01 000,101 operations are called dcviccdrivcr 

mo^f ^ ? C °T 0nly COmpriSeS ,Me » r routines. Of the*, the functions called k^rto to 

one or more file systems, each of which spans a single 45 write data from primary storage to secondary stomTnr 

logical partition or subdivision of a secondary storage called write stratcgyWtin« ^th^ff JSfcd fa 

medium such as a disk, and which organize the data within order to read date fTm s^ondanr stoas iSo 

^partition. Generally, multiple implementations of file storage arc c^led read ^ Sn s ^ ^ 

S-T, ' ? f 0 ^ 8 " COmm ° n * of Hardwarecontrol39isre S rSnsibSn a ndhnEinternipts 

d1n^„h,L ^ rc, T CT f thC °P era ^ g s y stem and 50 " d ^ communicating with u,e machkrnardwa^T^ 
oAer implementations by virtue of spe- devices 12 and 13, such as disks or tern^als^Ty ktmam 
aalized functions and structure that are appropriate for the the processors (CPU) 10 while a process to 
specific uses intended by the implementation. The common i B te£. P ted. the kernel 20 may re^ exwutio? of 8 «5 

.ofsX?S?fr soc ^^ ed M -^ c ^^^^ 

clSi ^*2n. w£w 198^ STthll^ 8Cn ^ $ f yStem - WidC SCrV,CCS ' SUch as ^^n«on^3 

?v7r V KU^ W ' i . 15 ' (2) ^ (Ar&T) contto1 of networks, line printer spooling, handline of many 

^ThI ^T*- r MaUnCC J - BaCh - ""^ Desi «" °f «^onal stations, and so on 8 f f 

^^£S2S.^£f T ^ Sy J te ^ lh c rARiFly ^m is_a J p a n_or J ub division o7 7 se?ondary 
Wrt^TAO FUe Sy,um AdmmttratorS Guide. VERITAS st^mediu m that is nilcaje^yTHc^sle-m inir^^ 
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Bjsktcsa work-stati ons aie also use T TC quch t lv. J fl „^/° * ? and free ' 01 610 «* 

ar e locals on a re mote "^TnT?,. rft i " < k «P»<* for any auxiliary data ascribing 

nef wo^For " ,m P ^ 5 the layout and characteristics of files and directories in the 

" XWr , fl n r o thf princlplef apply to anyd<xage3iwcTThe tystenL TheTefore > for simplicity, an overview of data 

fllb are Identified by nanwg, each nam* ic . stttnTof s ^ uetures 811(1 ^ system space management procedures is 

ch aracters. Cer tain filet are dktingiikt,^ „ dfectorlei, and S^" 1 . here for me VFS me system implementation, but the 

they contain names of other files along with pointers to their P™"* 1 " «nd the invention to be detailed apply equally to 

location in the file system. These names are called directory 10 UFS and ,0 omer ffle system implementations. However, to 

entries, and may refer either to plain files or to directories. Prevent ambiguity, the term UFS will be used to qualify the 

Each file name is also called a link". The system permits a description that applies particularly to the UFS file system 

single physical file within a file system to have more than Implementation, 

m lB¥* orUnk In the UFS file system the internal representation of a file 

While conceptually a directory may be considered as " >* given by an inode which contains a description of the 

containing the files for the file names the directory contains, device layout of the file data and other information such as 

fce files named by a directory actually exist separately and file owner, access permissions, and access times. Every file 

arc only referenced bythc directory. By allowing directories (or directory) has one inode, but may have several name! 

to contom entries to other directories as well as plain files, (also called links), all of ^^jE^ieW tofe 

? v ;^ < ^ 8 ^. P0 ^ StolteUSWsapa ^ 20 When a recess crea tor opens a file by name Zes 
ttye in which an enure file system is a tree structured e gch^ponent of fficlileTaSc-Zn S. 

caUed theroot directory of thefile system. The rwt directory d fenes and frLejnev w^ l SHL & > Hfffi 
re ers to a number of files and directories; the directories in SStTBfS ^aSS^^lSimS SZ^ 
tte root Rectory themselves refer to additional files and « an unused inode. Inodes are stored in teflC. and 

i^&xisr*' 811 avaflaWe aes and ass -» ~* ?*> 

UNIX file system implementations do not impose any 1ST ^ ** ^ ^ of 

T^fj^T'T 3 °f C0DtCa ? ° fplain mcs; these <"« A file system in UFS is a sequence of logical blocks The 

Z^toSESZTZZtr^"'^'* 30 ^ ***** M0dc is « ooavenTcnSttTetf me 
S«L„ y y a ^ l,Cat,0n smaU « f sire in which data transfer can be performed to 

Application programs may impose specific semantics on file secondary storage P 10 

Z^nn l ! , a T rdallCe ^ OCCds ' For « anmlc . Bach UFS file system includes a number of special blocks 

some applications may interpret line feed or tab characters which contain file system admlnkm^vr ' 

ffiSw^?*'*^ 33 -tents sT^X^t^Xu^n^Z 
this .nformahon « of do sgnificance to the file system knowledge of application proKrams One of ttem 7« J 

different matter. Directories cont ain information about file Information identifvlnn availaWeln^I «nH T~ 
S^S^^^ 5 (unailocate'd ^StV^^orT^^. 
gBg^^ufflgfoi to tne tiie systcmjmgjementa- 40 pact layouts, UFS divides a file system into smaUerdvis?OTs 
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.^f^l^! 0 *" . the . filc j s y stem organization is brought Ordering for Structural Modifications 

into memory, examined and altered, and written to second- Manv file svst™ im^^^T? 

an inode into memory to access its dataVand writes the toode S nSS * , UFS ae W«m *ows why 

implementation therefore attejnpiTlo^m^Tn?^ mtmJ?^ I"" 1 me kode ™ identified. 

st iUauial Uiiumauun In me PHg/c^h741 orX S JST * ^f, ^.ff su ^ as owner ^ Protections, and 

^4»,^^ a A^MlTc^i *L^S^ £L^f„ bl0 ^ ^ f * *!* to System > * w «<* the file 

ZT^ Mm ^ " USCTm ^ BuffCTC ^ he40 on " lfae d^ory entry is mSned^ £e SShe wKe J? 

^ y^x'P 001 I* m ° nc " v t> "' t " inode and thefile system iuprtto*3udS^ntrK 

*!■ " ^ St ° nflE ^"S^' 1 "' «** list are maintained^ 2e b3« S A* a 

dinlll^ ystemmirjejpen^ simplification, the term "buffer" will be used to refer to an? 
d ate are cachcq, M me page cachr. 41. while stnlrfnrSnr, ,,. 2J ln-memory copy of data, whether that copy is stored in the 

such as modes, supcrblocks. and indirect block* »rr fry, in buffer cache 40 or the page cache 41 

inTbu ffer cache 4 0; however, a file system i^i^nr.^,. To remove a file, the UES file system performs the steps 

may choose to maintain file da, in fr llffrr cche^ r tn in (he following TABLE 1. v 
mainu. n rtn.r*,.,,! fn ^ ^ rT|rhf fw ^ 

P ° n 7 « ^orithmfc coPvenienc e. 30 TABLE 1 

Read/Write Procedures ' — ' 

'—When reading data from disk of secondary storage 12 the z write !t Sfl^ "ft™™* the file nme. 

k-* first attempts to f md the data in bSpageS* * ^rr^lT^^ 

and docs not have to read from disk If found in cache 33. If reduc « ^ o, tto file wm need to be removed. Zero email 

the requested data block is not in the cache 33, then the 35 P 0 ^ 18 •» described by tbe fiie mode, 

kernel 20 calls the disk device driver 34 to "schedule" a read ^^Z^^™*™ bctWMn ** ^ 

request, and suspends the current process ("puts it to sleep") «• Write the buffer th.t com*™ tbe file mode to disk, 

awaiting the completion of the requested I/O. The disk 3 Return the data bkxks that bei^ed to the'fiie, to the £n» block Ust 

dnver notifies the disk controller hardware to read the data. n * c ^ « "W. 

which the disk controller later transfers to the indicate ^ ' 2? w " ch2n * pd 111 (*>. nmk the 

ttoller interrupts the processor when the I/O is complete, and 8 * Rotum fito "w* «t*ii to a list of frw modes in the file mrem. 

the disk interrupt handler awakens the suspended process; ■ 

the requested data is then in the cache 33 ' ■« *u ^ 

in case there is a subsequent read attempt for the date Whrn {~*a^Za * of 7 (step 2 > to disk, before the 

writing, the kernel ^U^SSSSS'^ SlTt J^SZ - } t0 or re ^8ned 

writes by deterraining whetoer the data b!Z 2 Ut , T ° thcnwsc ' ^ me VHem crashes before the 

Tbeu^stylesofdiskwritco^radonsdescribedeariier: wSlS^ V/^ 

synchronous, asynchronous, and delayed writes have dWrr lu ?i • rcassi 8ned To the user, it would appear that the^ 

i^te^ca^/p^^ " 

If the write is delayed, then the page or the buffer of cache h^i Jf d 40(1 " auh ha PPens 

write pages or buffers in c^hc ^IZ^cZ^ut V . have taplemented this ordering 

(secondary storage S ^ Ae im^m^, « I « <Usk «y««*«»«>»« writes. In the example of TABLE 1, the 

UrS^m^ ^^ 65 ^s^s2and4aredone S yncrZously;«hus JuruVg 
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In a synchronous write, the process issuing a disk write 
waits for the write to complete before continuing execution. 
Because typical delays in writing to disks are of a magnitude 
comparable to the time it would take for the processor to 
execute thousands of instructions, when a process needs to 
wait for a disk write to complete, the process control 
subsystemrcassigns the processor to some other process that 
is ready to continue. Thus, synchronous disk writes give rise 
to context switches. 

Some consequences of using synchronous writes are 
described in connection with an example which consists of 
the execution of the UNIX system command: 

nn file! file2 file3 
where the command (rm) removes three files (filel, file2, 
file3) in a single directory. For purposes of this example, the 
directory entries for all three files are in the same buffer of 
the directory, as would be typical for moderate sized direc- 
tories. Also, in the example, the inodc data for file2 and filc3 
are stored in the same buffer while the inodc data for filel 
is in a separate buffer. 
Operation Of rm Command — FIG. 3 

FIG. 3 illustrates the operation of the above rm command 
in a UFS file system implementing ordering by using 
synchronous writes. It shows the ordering requirements for 
each file, based on the implementation of file removal 
procedure described in TABLE 1. FIG. 3 corresponds to a 

sequence of requests Rl, R2 R9. Note that for each file, 

the order dependencies are that the buffer containing the 
directory entry must be written before the buffer containing 

(l..'..J. l*L.t../P ...... & 


12 


TABLE 2 provides a comparison of the three write opera- 
tions used in the kernel. 

TABLE 2 

^ Synchronous Asynchronous Delayed 


10 


15 


Immediately 
High 

MajginaUy > 1:1 
No 


Writes to Disk When? Immediately 
High 
1:1 

Yes 

Yes No 

Yes No 

Yes Somewhat 


1. 

Z Data Integrity 

3. Ratio to Actual 
Disk Writes 

4. Waits for Disk 
Write to Complete? 

5. Can be used for 
Ordering? 

6. Causes Context 
Switches? 

7. Disk Throughput 
Limits Program? 


Medium 
Many: 1 

No 


No 


No 


Minimal 


20 


25 


As shown in TABLE 2, each write operation provides a 
different tradeoff among various characteristics. Synchro- 
nous and asynchronous writes provide greater integrity on 
secondary storage in the event of machine failure (see line 
2 of TABLE 2), while the use of delayed writes gains the 
benefit of improved performance but is more vulnerable to 
a system crash. Since delayed writes minimize the coupling 
between disk subsystem and the CPU (line 7), as well as 
reduce the actual number of disk writing by promoting write 
caching (line 3), they tend to be best suited for achieving 
high system throughput; and since they do not cause extra 


- - " r • — -«v «w mvv vauot CA.ua 

ihVin^ «nHIk^k^y , ' "'"^V" 1 " 1 "."' 6 c ° ntext switches (line 6), they improve individual response 

the mode and the buffer containing the inodc must be written » times as welL 

before the data blocks are returned to the free list. In Delayed Ordered Writes (DOW) 

addmon, FIG. 3 shows the additional order dependencies The Delayed Ordered Write (DOW) subsystem 30 of FIG 

Oiat cachfUe is handled in sequence; all the actions for one 2 of the present invention provides a more efficient solution 

file are performed before the actions for the next file and so to the disk write ordering problem. DOW, implemented in 

on. in general the order dependencies Dl, D2, . . . , D8 for 35 one embodiment in a UNIX Operating System, doubles 


, R9 of HG. 3 are as follows: 


requests Rl, R2, . 
Dl: R1-»R2 
D2: R2-4R3 
D3: R3-+R4 
D4: R4->R5 
D5: R5->R6 
D6: R6-»R7 
D7: R7-»R8 
D8: R8-+R9 

The order dependency Dl, for example, means in the 
expression R1-+R2 that Rl is to be performed before R2. 

The eight order dependencies Dl, D2 D8 result from 

a simple time-order presentation of the requests Rl, R2, . . 
. , R9 so that a single time-order subset of order dependen- 
cies is formed Dl->D2-»D3-*D4-»D5-»D6->D7-+D8 
meaning that the requests arc specifically ordered as 
R1->R2->R3->R4->R5->R6-*R7-»R8->R9. 

As shown in FIG. 3, there are two synchronous writes per 
file (requests Rl, R2; R4. R5; R7, R8), resulting in a total of 
six writes to disk and six context switches. Note that one 
buffer contains all three directory entries (requests Rl, R4, 
R7), and is thus written to disk three times, and, similarly] 
another buffer contains the inodes (requests R5, R8) for both 
file2 and file3 and is written to disk twice. 

Comparison of Synchronous, Asynchronous, and Delayed 
Disk Writes 

Asynchronous writes, and Delayed writes arc two alter- 


40 


45 


50 


55 


60 


system performance by reducing the amount of disk traffic 
as well as the number of context switches generated by 
synchronous writes. DOW provides a mechanism for con- 
trolling the order in which modifications of file system 
structural data are recorded on disk, without using the 
one-disk-write-at-a-time style of synchronous disk writing. 
Large gains in system performance have resulted from using 
DOW in place of synchronous writes within the UFS file 
system implementation. These advantages arc obtained 
without requiring a structural redesign or a change in the 
media format for UFS. 

DOW includes two parts. The first part is an interface by 
which file system implementations, or any kernel subsystem, 
specify the sequences in which modifications of file system 
data blocks can be recorded on disks. These sequences 
translate into ordering dependencies among disk blocks 
themselves, which are collectively represented by an order- 
ing graph (entries in an ordering store), prepared by DOW 
in response to the specification. The second part of DOW 
consists of mechanisms responsible for ensuring that the 
operations of the ordering graph (indicated by the entries in 
the ordering store) are performed in the order specified. 

DOW is a modular service, loosely coupled with other 
kernel subsystems, including the file system implementa- 
tions which use it In the preferred embodiment, no modi- 
fications are made to other kernel subsystems (including 
device drivers) or to the standard file system interfaces. The 
changes to the file system implementations that choose to 
employ the DOW mechanism are simple in place code 


^-.-as,.^ . smelts^ jxtjfts: 
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Any file system implementation in the operating system FIG. 3 is analyzed to understand the transformation from 

can therefore use DOW easily, without structural redesign synchronous writes to delayed ordered writes The "+*" 

and the loss of media compatibflity that might otherwise boxes (A and Q in FIG. 3 represent disk blocks having 

result. Furthermore, while DOW provides the mechanism contents updated in more man one sequence of ordered 
for ordering disk write operations, the file system retains 5 modifications. The one-at-a-time handling of file removals 

control of the policy for ordering, that is, which disk write which occurs as a result of synchronous writes, as described 

operations should be ordered and how. above in connection with FIG. 3 and TABLE 1 is not 

Terminology and Conventions necessary for rcpairability, so the sequencing shown in FIG. 

For an exposition of how the DOW subsystem 30 works, 3 between boxes R3 and A(R4), and between boxes R6 and 

aisusefiUtorepreseiitmea^ 10 A(R7) can be removed. With these changes, the control flow 

in a tune order as nodes (operation entries) of a graph and of FIG. 3 is transformed into a directed acyclic graph 

to represent the time ordering constraints as directed links Directed Acyclic Graph— FIG. 4 

(order entries) between the nodes. The nodes and links in the thp /Wwah *™*u~ J uwr* a • • a 

graph arc stored as entries located in an ordering store for a ^Zt K, ^t^^ 

example in system memory such as main stae 11 of FIG. 1. u & Rft ' I £g SfSf £ JZ £2*1? ? h **' R? 

*u i • * . a. • ~~„ T , . 10 > Ka or riu. J) or the same buffer that are common to 

JS?£?JZ ^ " °i Acnng ""^ s «J u ««=s into a single disk write of thTbuffer 

%£Z^«T£^JF*£* ^T.^T Magia * common » to ^ write in this example 

relative to writing of some data to disk For example, in FIG. ^ a t » , ^ " T V7T 

3>erunctionswmchre^^ *£J fj* ^ die use of deM ^ 

files to the file system tree list (requests R3, R6, R9), are * £ * ^ W ^f? ^ aCti0nS * bC 

relented in a DOW ordering s orl by node of the second T 1 ^? ? 5" 0rte rcUtivc t0 mc dda 3^ 

ordered function call mat is represented by a function node ^ 

is called a deferred procedurecalL * locks °' a rcmovcd mc t0 me frcc Ust < bu < 

For convenience of description, the nodes in an ordering K™^, ^ 

graph are identified symbolically as Nl, N2 Nn. A^ Z^L ZZ^ " T" t0 ? ^ Wock 

ordering constraint between two nodes M and N2 of the 30 S^JTS* OTdecmg 15 a 

ordering graph, such that the action corresponding to Nl corresponds to te mode buffer creating 

must happen before the action correspondmg to N2 can £££ ^J^ftT'T^ * ^ ocedure wouId 

happen, is represented by a link directed from Nl to N2^r, W ° cl ? for mod ^ 411(1 mcn ^"^8 that 

intext as N1-»N2. ^x^or, the procedure can be invoked only after the inode buffer is 

The execution of a node in the ordering graph refers to the 35 WI ^ CD 1I t ° , ^ 

action of pcrforniing the task that is represented by that n ° mc synchronous write operations 

node. Thus, executions of DOW graph nodes result either in ° f H0 \^ ordenn * of FIG. 4 is only partially 

the writing of a particular data item to secondary storage, or $ T ^ * nd Ic ^ scnts a ordcrifl g among multiple 

in the case of function nodes, the execution of the procedure "S? me u s >^ cm ^ s(nJClurc modifications-some of 

call that the function node represents. 40 mcrcforc > 01601116 concurrently. SpecificaUy, in 

For two nodes Nl and N2 in the ordering graph, if a link « 1 u COmmoD ordcx <kp««kncies CD1 and CD2 for 

N1->N2 exists, then Nl is called a predecessor of N2 since * * m: 

Nl must be executed before N2 can be executed. CD1: A->B 

Alternatively, since the execution of N2 depends upon the CD2: B-»D 

execution of Nl, N2 is called a dependent of Nl. 45 The common order dependencies CD3, CEM, and CDS for 

As defined above, the ordering graph is directed that is, a second subset are as follows: 

any link between two nodes in the ordering graph is directed CD3: A-*C 

from one node to the other, to represent the ordering q^. c _^ e 
requirement between the actions that correspond to the two 

nodes. In addition, an important characteristic of the order- so C-»F 

tog graph is that at all times, it is free of cycles. If a cycle Notc m * 1 mc ^ and second subsets of common order 

existed, for example, among 3 nodes Nl, N2, and N3 of the dependencies for HG. 4 are independent of each other, 

ordering graph due to links N1-»N2, N2-*N3, and N3-*Nl t Four ftoccss Example— FIG. 5 

then the ordering constraints suggested by the three links The operation of delayed ordered writes is further illus- 

would be self-contradictory, since at least one of the three 55 traterf bv a second example in FIG. 5. In this example, it is 

constraints that the links represent would be violated in any assumed that four processes operate separately on files in a 

order in which Nl, N2, and N3 are executed. Therefore the directory. Two of the processes each create a file by using the 

ordering graph docs not contain cycles, and is said to be a unix touch command: 
directed acyclic graph. 

Reducing Disk Writes with Delayed Ordered Writes 60 ™ 

Delayed Ordered Writes (DOW) combines delayed writes touch fiIcntmc 

with an ordering store mechanism for controlling the order — — — - — _ _ _ 

in which data is written to disk. Thus, the use of DOW which creates a zero-length file called filename if it does not 

allows the system to obtain the performance improvement already exist in a directory. The other two processes each 

associated with delayed writes while retaining the file sys- 65 remove a file from the directory using the unix rm command 

tern recovery capability that is provided by synchronous The four processes execute these commands, in the same 

wntcs - directory: 
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Proces* 1 

Process 2 

Process 3 

Process 4 

touch file I 

touch Mle2 

nn filc3 

rmfik* 


When creating a file name in a directory, a process 
increments the link count in the inode fox the file and adds 
the file name to the directory. It is only after the increased 
link count is safely written to disk that the file name itself 
can be added to the directory buffer. Otherwise, an inter- 
vening system crash would cause the file inode to have a 
smaller link count than the number of actual directory 
entries referencing the file so that a file system repair utility 
could not correctly determine a proper recovery action, and 
so would be forced to leave the file in the directory from 
which it was being unlinked prior to the system crash. 

FIG. 5 illustrates the steps followed by each of the 
processes in accordance with the ordering dependencies that 
apply to creating and removing files. For purposes of this 
example, assume that the inodes for filel and fiie2 are in one 
buffer and that the inodes for file3 and fiie4 are in another 
buffer. Also, assume that the directory entries for filel, file2 
and file3 are in the same page buffer; and the directory entry 
for flle4 is in another buffer. 

The four processes run independently. The time order of 
each update request per process is important and the overall 
tune order of all the requests will be gone interleaving of all 
the per-process requests. The overall time order is 
unimportant, so long as the order dependencies of update 
requests within each process are preserved. The order depen- 
dencies of FIG. 5 are as follows: 
Dl: R1-+R2 
D2: R3-+R4 
D3: R5-+R6 
D4: R6-+R7 
D5: R8-+R9 
M: R9-»R10 
Directed Acyclic Graph — FIG. 6 

FIG.6showshowtheora^grequirememsofFIG.5can 
be met by using DOW so that it is possible to combine the 
common elements among the steps taken by the four pro- 
cesses. While the resulting overall order among all the steps 


is different, the sequencing of steps within each process is 
preserved. Specifically, the common order dependencies of 
FIG. 6 preserve the order dependencies of FIG. 5 while 
reducing the number of separate operations required. 
3 Specifically, in FIG. 6, the common order dependencies are: 

GDI: G-»H 

CD2: H->J 

CD3: 1-»J 
10 CEM: J-+K 

CDS: J-»L 

The common order dependencies CD1, CD2, . . . t CDS 
OF FIO. 6 preserve the order dependencies Dl D2 * D<> 
of FIG. 5. f ' 

Constructing An Acyclic Ordering Graph 

The potential I/O reduction from delayed ordered writes 
relies on combining delayed writes with a control mecha- 
msm based on an order store storing entries, which consti- 
tute a directed acyclic graph, in which the graph represents 
ordering dependencies among data modifications. TABLE 3 
presents a summary of the DOW procedures that are avail- 
able to file system implementations for use in constructing 
the ordering graph in the ordering store. 

TABLE 3 

Routines far Comtructfag an Ordering Graph 


15 


20 


25 


dow_ create 


30 


35 


dow_ojder 

dow_jtartmod 
dow_/etmod 


dow_rele 
dow__abott_range 


40 


Create, if it does not exist, a node in the ordering 
graph which corresponds to either a delayed write 
buffer or a deferred function execution, and 
return an integer identifier which may bo used to 
reference that node. 

Specify on ordering between two nodes in the 
ordering graph. 

Indicate that data in a buffer is about to be modified. 
Indicate either that modification of data in a buffer 
or the setup tor a deferred function execution, 
has completed. 

Release a hold on a node identifier. 

Destroy all nodes in the graph wnreeponding to 

a range of buffers in a particular file. 


In TABLE 4 below, a code sketch in C language illustrates 
the transformation from the example of FIG. 3 to the 
ordering graph of FIG. 4 using the routines of TABLE 3 


TABLE 4 


File Removal using Delayed Ordered Writ** 


Iteration 

** 2* 3 Operation 


1 dowid_t pgdow, ibdow, How- 
2t* 

3 • make ■ node in the orueruig graph corresponding 

4 • to directory buffer which contains me directory 

5 • entry of the file bein* removed, and men clear 

6 * the file's directory entry 

7 */ 


( (A) 8 P** 0 * = dow_oeate(diiectory buffer rantamina entrvV 

9 dow_staitmod(pgdow); '* 
10 

11 . clear the portion of the directory buffer containing 

12 . the directory entry of the file being removed. 


14 dow__ietmod(pgdow); 
15 

16 /• 


17 ♦ miitc a graph oule contspouding to the 

18 • bode buffer, and sot up ordering to write 
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TABLE 4-continued 


JTlk Removal wing Delayed (frft^ Wri,*^ 


Iteration 

}* 3 Operation 


19 * the mode buffer to disk after (be directory 

20 * buffer is written to rfity 

21 •/ 

C » ^ 22 ibdow = dow__crcale(iaode bufteri; 

A->B, A->C, (A-*C) 23 dow_j>nfcr(ibdow ( pgdow, 0); 

24 dow^tamnodfibdow); 
25. 

26 . Decrement (be inode lick count and zero out 

27 . the mode data block pointers 

28 . 

29 tow_jetrnod(ibdow); 
30/* 

31 ♦ set up ordering: call Amnion to free 

32 * the blocks after writing the ioode 

33 */ 

D,E,F 34 Wow = dow_create(fuDctioii to foe data blocks, 

„ _ 35 block list specification); 

B->D, C-*E, C->F 36 dow^rder<fuow f ibdow, 0); 

37 dow^tn»d(fdow); 

38 . 

39 . mtrk the mode as free and return the inode 

40 . to the free inotfe list 
41 

42 /* 

43 • release the node identifiers 

44 •/ 

45 dow_rele(pgdow) ; 

46 dow_jele(ibdow); 
47 dow__rele((aow); 


c^t^^^Vul C ^J^J b ^ fW ^ ■** «* * to have the variable 

facility. The code segment appears in the ri^t cohunn L ?*"*."!" need to te modified ™» 

each of the three iterations, (he execution of the code infof AfZe 21 L n ?^? - ^_ u 

right column results in the construction of the ordering £aoh fhrfol i A ~^?!< - dt T 8 «**■"* bet *«n Pgdow and 

of FIG. 4. In the left column, at lines 8 22, 23 ? Ld S 40 J£T. .T^l?!? ° W - CKte 1116 effect of this is <° 

•"tluxecomnusepara^ 40 t oT in 4 ^ 8 d ^^ mp8<IOWtoibd<w ' U »» t ".fr<>»A 

nodes or links in the ordering graph that result froni Z a s 7„T1. ?' t . 

etecuuoo of the corresponding lines of code in the rkht rJjJT J - pgdow ' 1,16 ktent t0 mod »iy inode 

column during the respective iterations. (a ?? nencc the buffcr whi <* contains the inode) is signaled 

Id the code sketch. line 1 is a declaration of three DOW 24 l ?i he °^ to *>w_startmod for ibdow. Again, the 

Identifiers, called pgdow, ibdow, and fdow; respectively ac ? ,a, mod,fication " n <* shown for brevity because it is 

these variables are used to store identities of graph nodes impendent of DOW usage, and happens in lines 25-28 

conesponding to disk write of a page, disk write of a buffer, lten ' * Mt 29 < H» indication that the buffer is modified as 

and a deferred procedure call. Lines 2-7, 16-21, 3fr-33, and desired ' " &™ 'o the DOW subsystem via the call to 

42-44 arc comments. dow_setmod. 

im ^ g .^ C itmti0D: At 8. the file system 50 At Km 34. the file system implementation calls dow 

unp ementauon requests the DOW subsystem via dow_ create to request the creation of a function node. The" 

£ XSSXr ^ W **S ^ t0 M ttc Unction -de i 

wauuw noce, me result of this mapping, is to have mapped, is one that would need to hi- ^t^.,^ :„ «i . 

sy^inS^SK " f .«**• between ibdow and 

system byissuing the dow ^tartoci caU (^e9^L^ « ^JLT^^ dow.order. The effect is to insert a link 

Aftermat.mlin^l(>-U.thlSafmiS^ " m ^ ^ i, from B to D in terms 

code for which is not different in absence of DOW and is not p^-l.. ' # . . . 

shown for brevity. Tnen at line 14. the Se^m £ C J^touS^Ef* 36 

mentation calls dow_»clmod for pjjdow to sicn at 7rT.hr «^ ., • ^ P ,. UlC fi,nctl0I, Dodc; hencc dow - 

DOW subsystem thatthc modmcatiooT compete « So^J T d * " ne 37 '° S,gniJ thc readincss of u « 

At line 22. dow_aea,e is called ag^ °Z creating a ISSv^SJ?. ^r" 10 !; * T foU0Win8 ^ ^ 
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TABLE 5-continued 


rwprtsrnring these two data items. 
Call dow_^tajtmod to indicate that this datum is about to 
be modified. 
Modify tbo datum. 

Signal that the modification is complete, by using dow_j*tmod 
to mail (he graph node as "modified". 
Finally, at any point after the last dow_order or dow_setmod 
call involving this graph node, release (he graph node through a call 
to dow__rolo. 


Actually, the node pgdow could be released at any time after 
line 24. Hie DOW nodes, pgdow, ibdow, and fdow, are 
released in lines 45-47. 

During the second iteration: As in the case of the first 
iteration, at line 8, the file system implementation requests 5 
the DOW subsystem (via dow_create) to map a directory 
page in which the entry for file2 exists, to a DOW node. 
Since the entry for file2 is in the same directory page as for 
fllel the result of this mapping is to have in the variable 
pgdow, the same DOW identifier as in the first iteration; that 10 
is, the node A in FIG. 4. At line 22, when a DOW node for 
mapping the buffer containing the inode for file2 is 
requested, the result is that a new identifier is created and 
written into the variable ibdow. This new node is shown as 
the node C in FIG. 4, and in the left hand column in TABLE 15 
4. 

At line 23, the ordering constraint specified between 

pgdow and ibdow via dow_order results in the directed link 

from A to C in terms of FIG. 4. At line 34, a new function 

node is requested for mapping to the deferred procedure call 20 

that would release the disk Mocks from file2 to the file 

system free list This creates the node E for the ordering 

graph shown in FIG. 4; the identifier for the node is kept in 

the variable fdow. At line 36, the ordering constraint 

between ibdow and fdow is specified The effect is to insert 25 
a link directed from ibdow to fdow, that is, from C to E in 
terms of FIG. 4. 

During the third iteration: The third iteration, for the 
removal of file3, proceeds analogously to the first two 
iterations. In this iteration, the directory entries for fiie3 and 30 
file2 share the same page and the inodes for the two files 
reside in the same disk block. Hence, at lines 8 and 22, the 
same DOW identifiers are returned for the two iteriuons. At 
line 23, the ordering constraint that is specified between 
pgdow and ibdow, results in no new work since the ordering 35 
link from A to C was already created in iteration 2. 

At line 34, a new function node is requested for mapping 
to the deferred procedure call that would release the disk 

Thus when all three Iterations are completed, the DOW 4S Fields For Operation Bnlrles Of Orderina Store rnow 
subsystem has the necessary ordering requirements between Node)-FrG 7 8 ( 

Je various disk writes and the deferred procedure calls in Bach DOW-node contains information identifyin* the 

P S 16 ?" ^ """""l" of HG - «• °P** te ° «P«atod by a graph node as well 

frue system implementations typically use these DOW mation about the dow-nodc itself. Storage for dow-nodes is 

tuncuons ,n conjunction with modifying structural data. For so obtained from a statically configured array, as shownTpiG 

uie example in TABLE 4. the modi ft™ firms nf ctm/*itrai A«* n i Ro^h i« 


The functions dow_ startmod and dow_jctmod provide 
coordination points between a client of DOW and the DOW 
subsystem. They are used to notify the DOW subsystem that 
the client is modifying a page or a buffer that corresponds to 
a node in the ordering graph. 
One key aspect is the rule: 

When establishing an ordering between first and second 
operations (between first and second common writes or 
a first common write and a function), dow — order is 
called only after dow_sctmod for the first common 
operation has completed, but before dow_ J startmod for 
the second common operation is called. 
The reason for the rule can be understood as follow: 

1. After dow_order is called, the associated writes of data 
can occur at any time. If the call to dow_order takes place 
before modifying the first data, then the first data may be 
written to disk before its modification is effective. The 
ordering requirement is violated if the second data is then 
written to disk. 

2. Until dow__order is called, the writes of each data are 
unordered relative to each other If the second data is 
modified before the call to dow_order, the second modi- 
fication can become effective on disk before the modifi- 
cation of the first data has propagated to disk. 

DOW Subsystem Data Structures 

The data structures used for representing the DOW nodes 
(operation entries) and their ordering linkage (order entries) 
exist in an ordering store. The DOW ordering store 


the example in TABLB4, the modifications of structural data 
occur in lines 10-13, 25-28, and 38-41, identically with the 
usual synchronous write based implementation. With the 
synchronous write based ordering, the modifications are 
followed by disk writes; with DOW, the modifications are 
followed by the primitives that construct the ordering graph. 

The general outline followed by a DOW client when 
modifying a datura which must be written to disk in an 
ordered fashion is described in TABLE 5 as follows: 

_ TABLE 5 

1. Create « graph node corresponding to the datum and 
acquire a rmrfc identifier referencing the node by 
calling dow_crcato. 

2. If there b a datum which must be written prior to this datum, 
u« dow_order to specify an ordering betwooo the two nodes 


55 


60 


7. Each node is identified by its dow-id, which is its index 
in this array. For convenience and memory economy, linked 
lists of dow-nodes are constructed using dow-ids rather than 
memory addresses. 
In FIQ. 7, the fields of dow-node ordering store entries for 

node Nl and N2 from an array NI, N2 Nn of nodes 

are shown. Bach node includes the following fields- 
Free list linkage. 

Linkage for a doubly linked list of unused dow nodes. 
H ash Her linkage . 

^feifiiWTdoubly linked list of in-use dow-n<jdes, 
hashe£^idenUty.JOs"e^ speed searcIierfofTdow^ode 
for a parti cular operation. 
^lusji_chffi"lmfcage. 

The Flush Chain Linkage (FCL) is a linkage for the 
doubly linked list of dow-nodes whose operations arc to be 
executed. 
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^Tn.^lf' a ^ dow-nodc, the lists arc not anchored at any field of the 

D^XT^nt ^ 1 n0dCS - d ° W - nodc stradurc - * c anchors forttese lis* are 

^ ♦ , provided by two special array of DLEs. This structure is 

SyS ^£bL ^ f. b0Wn ta «»■ »• «»e storage for Ore heads oTiLe 

a v J!hTft • * J 10 s P edal DLE ""y* 35 s hown in PIG. 9 simplifies the 

DOwS? SynChr0m2atl0n - ^ »* management Except for this detaU,toanchors 

can be viewed simply as parts of the dow-node structure that 
Tracks state information such as whether the dow-node's are given sepajate storage 

to^Hi^J 0 -^ ^ tUted ' Wbeth " ^ 0perati0D has Of Dow-Node And DLE Strucrures-FIG. 10 

Sh?^!v^ a< ° P T ,10 ° bas TOm ? leted - <* '0 W contains an example of the orgaSon of the 

SS fSSi PrCdeCeSS ° rS tove ^ SCarche<t dow : DO t and DLE ■«» ** represent!* ordaSg 

A field used for miinHna 8r " ph of nG - * 111 ^ exam P lc . * e anchors for the 

Afield used for counting. predecessor and the dependent dow-nodes lists are treated as 

•Tfc". *. * extensions of the dow-node as a simplification for ourooses 

The type of operation represented by the node, for is of explanation uupuncauon ror purposes 

^.J.rf.^.Arf.Wfcr,.,,^ Consideroow-noaefoA.Itspxed^ssc^listisernptysits 

Senary dependent list contains DLEs 1 and 2. The dow-id for DLE 

n„ _ .> c .1 ■ i . 1 is the dow-node for B while the dow-id for DLE 2 is thr 

Parameters of the delayed operation that the dow-node dow-node for C In this way, B and C are 7dSfi£ as 

For a nan- writ. A, iA,***.. « m •« • .. . 30 de P endcnts of A - Ncrt consider the dow-node for B. Its 

Por . buffer write, ttcJS^ field specifies . device Ki^S^IS^^ 
nUmba ^ch identifies the buffer. of only one element, namely%LE 4 fldd of 

Jlt^^^^ C ^Ji fteidCn,ifyfiCldspcdlieS 25 D " 4 D*»he sole de^ni^^^ 

to ££££ " to 1X5 P 3SSed »-«* * e Anodes InT organizaUofoi bTS 

FIG. 10 represent the ordering links of FIG 4 

For each dow-node, there arc two linked lists of dow- Executing the Ordering Graph 

S^r^^ e r rf ,^r^ SOr ^S Mdtoe(, ^ ndent to addW o n 10 l*iniitive7for constructing the orderine 

Addresses n 

In FIG 7 the fields of DOW I tnir Hi,™-, mi m -. "fTf ™ ''P^ 011 on a buffer corresponding to a node in 

DV . . , Dm are shown. Each entry includes the following precede the requested operation arc S SL Tc 
Unking Element. requested operation and in the proper order. In this way. a 

DOWID 50 of the oodc that represents the action in the ordering graph 

OI £ _ M „ 60 chain of links, 6 ^ F 

The two DLEs— one on the dependent list for A and the 
other on the predecessor list for B are called inverse-links of { A-+D ' D "* c * C ~* D > 

each other A DLE contains an additional field identifying its exists amonc nodes A B r anrf n w.« ...» 

dow nodes and the other, for dependent dow-nodes) for each exists between these nodes. Thepolicy followed in meDOW 


01/28/2003, EAST Version: 1.03.0002 


5,666,532 

23 

subsystem, and is based on some optimZy criLS ^ JSSLSSm ' nodcs «■ 

the operation thai is represented by^Snto ZfnSS ™ * d ° Same with 

& a Ph is completed, the DOW subsystem removes the liniJ , £^T B ^ 

between the node and its dependent nodes, since the comple- .5 ^ f? nodes have 110 Predecessors, it places 
tion of that operation removes the ordering coiistiamt that a™ ° D dow - flufih chain for execution by the dow flush 
existed between that operation and others needing to follow ! m0D : 

it So when node A, in the above example, is executed the 7116 dow - flush -- daeitton checks the dow _jaush__chain 
link A~>B can be removed; this then permits the link D-»A Periodically, and for each node on the flush chain the 
to be added without creating a cycle. Thus, the DOW 10 dow -*u*h daemon initiates the needed deferred operation 
subsystem may initiate the execution of the or more DOW a deferred operation on a buffer completes, the 

nodes to occur, in order to accommodate a new ordering D0W subsystem receives control via a callback mechanism, 
need specified by a client of the DOW subsystem. Since the deferred operation is now completed, all depeo 

The starting node in the ordering graph that is requested dence links between its corresponding node and other nodes 
LiL^Tf^JL 0 ^ * C requestcd node ' ™ e °OW 15 in toe graph are dismaii^ 

S^ ^^T 5 CXC< : Utkm by searchm « me com P leted was last predecessor of some other graph 
ordering ; graph i <«■ sub-graph), staring from the requested node, and the latter node had been marked 'Wed* 
node arid identifying all the nodes that are predecessors of the latter graph node is now conTto^rSvfeS^ 

li the trivial case, if there are no node! wMdi num be executed?*^ ^ immediately after its list predecessor Is 
executed before the requested node, the DOW subsystem 

can initiate the operation that corresponds to the requested EXAMPLE 

Tie actual implementotion of the search for nodes to be 4 S orS J^lTm^ ^ T UStra ^ *»■ f* 00 *' "^8 the 
pruned, and thTaurying out of the »Sou*?££iZ m^. 8 '^TSl 4> N ° te ** * e ° rderin 8 graph of 
represent, is the 33 ^iSSS^ST Z^SS^mS* " 3 ^ * l 
s*a«cgy functfon and a dow_flush daemon, ^municating 2SS2iSl! S * 
^^^ s ^ M ^^_ aush ch J Daemon wS££23£ II 

For the DOW subsystem to be able to mediate in this FTO £ * " node C * dieted in 

manner, the file system implementations that use the DOW Insertion Into Dow Flush Chain-Prr. n 

facility must replace their calls to the device driver write n n «^ o ~h t -° au, ~ H0, 12 

strategy routines with calls to the dow suWr^uT ™de C . nd sc «ng that there is at least one 

buffer. If no such node exists, or if the node eiJsU but has £d ^ n« i£ h ' SmceA 14 91 of the graph, 

no predecessor nodes, then dow strati ^iL^^Jnb £ . d ^ end J ence "P° n oth « 8™Ph nodes, it is 

the buffer to the device driver ^^^fSS SI w£ VlT ( « "°- 12) ' from 

write strategy routine. ,« w i" be sdeeted b y *• dow_flush_daemon. 
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^H^ m A * C " d * mantle . d - Since C is marked pruned information tracked mdependently by the DOW subsystem 
and has oo other predecessors in the graph, it is moved to the creates the foUowing problem. ^system 

MoVTofcSu^bSn-FIG. 14 There are two conditions in which a call for asynchronous 

Subsequently, the dow_flush_daemon removes node C s J^i&^tt^ZFZ^Z?*" 

from the dow_flush chain, and executes it bv writine the ^ ^ ™« e * e - 0) *e page is akwdy being written to 

buffer to disk. Upon" eviction ofX w5™hi tl Vn^*? (S T * e "5"*™ write cannot 

between node C and nodes Band Fare removed. Since E and \ 0r ' C) * e me system Cementation finds that the 

F are function nodes, and ail nodes preceding them in the f ^^not ^'' .modified since the last time it was written 

ordering graph have been executed, ttey are moved to the ,„ ^ ( " SUaUy *" h . a PP e ° s ^ere is a race with some 

dow_flush_chain (FIG. 14). Thus, after the buffer contain- ^ of • Wnte * 8ame W). From DOWS 

ing the inodes for files 1 and 2 is written to disk, the TlrT v^'! ! m ? ertainty ' m casc »°<»* whether 

functions mat return their disk blocks to the free list can be ~H\ write that is in progress (or has just completed) 

safely invoked; this is done by the dow_flush_daemon as ff^V spe^fic inodification. or whether the modifica- 

it visits thedow_jlush_chain andproccsscs the nodes B and , < "T^J** (2) * about whethcr 

E r " me modification state in the DOW data structure is dear at 

Following this sequence of events, of the original graph ™ *? mo * ficatioa state « P>ge * dear, 

nodes, only nodes B and D, remain. At some foturetime . UBCC «f u " lcs would not arise at all, if DOW sub- 

these nodes will migrate to the dow_Jush_chai n either acccss «■« P«8 e structures (used by the 

because of explicit requests to perform their operations or mCm0ly . mana 8 emen t subsystem) under the cover of the 

perhaps due to expiry of the delay interval permitted'on a PP^P na | c ^^ r ysub S ystemlocks;butsuchaccesswould 

these operations compromise the modularity of the DOW and the memory 

Additional Design Considerations subsystems. 

The additional design issues in creating the DOW mecha- However . me UkeUhood of races leading to either of the 

nism divide into above uncertainties Js low. The DOW subsystem takes 

MODULARITY: Implementing the mechanism as a * f, ,° f ^Z^® 100 * of Mces - bv usln 8 ^ 
modular subsysteZwithin thf toel f^^gonthm. The dow^ush_daemon attaches a tag 

mORRECOVER^AJiowingafilesystemimplemen- LtpSg^VSKl £S ",22 2^ 

^^fromanerrordunngadelayedordered s^S£%^%«£^T£^ 

nBAmnf ,„ _ J 30 dow — flush — daemon can discover later that this has hap- 

DEADLOCK AVOIDANCE: Preventing deadlocks pened. In the infrequent cases when the dow flush 

resulungfromthe addition of ordering dependendes to daemon discovers thVt an attempt to write i paZsTnchro: 

resource dependendes in the kernd. nously failed, it hands the page over to ancZfceTnTat 

These design considerations are briefly described in the is spedally created for wridig pages synchronously, 

next three subsections. In these subsections, reference is 35 Error Recovery ™™>uusiy. 

frequently made to races between agents perfcrming com- For file system implementations which use delaved 

puta^ operations. This term, "race", is used broadly in ordered writ£ manyTf the wSH JSol a^ h2 

meOperatmgSystenuU^ asynchronously by the dow^ushJX uZZZ t 

o^Wd1Sw 0 f"?rr i0nalVariable ^ CaUteOf *"™<°f*«Werror,to^^ 

^^^c^csty^^tolKcomcinconsistal recovery action for the error. D^niaWsprovisionfoS 

tk- rJini u ■ , b y^owmganiesysteminmlementationtoremsterraitincs 

J£ ^ m ^ h,ttl£ra CTC f ed as a modular service to be called in the event ofan error. If an error occTsT 

wittun the operating system kernel, separate from the other DOW subsystem can call the registered mutincZL ft 

kernel subsystems with which it must coordinate to ensure « information regarding what opcS faU^ d why ^ 

deadlock free execution. As a modular service, DOW cannot Deadlock Avoidance y ' 

tr^ $ i°rT^ *?*r^ a U * mema 2' ™**&cmcnt One of the effects of deferred execution of disk writes is 

, k, ^l^" " nVm: D0W cannot - th » t ncw POssibiUties for process and system deadlocks 

Z ^' "** SUd> ^ 88 whethcrapagehas been arise. Listed below arc three broad deadlock siLtion,^ 

modified or is instantaneously being written to disk. To jo can arise with the DOW scheme. WulethSSScS 

c^umvent this problem, the DOW subsystem indepen- be addressed by certain standard txxhr^ues to 

^^w bltf ^ M / toU ^ tteStatU5 ° fdireCtorypl « es - avoidance (^ec, for example. A. clKT^ 125 

This DOW version of information can become stale in rare Design of Operating £Zs, » pp. 224-227 l4nUc*HaU 

«cum,tances. This poses a problem which is described 1574), the resultingMputatio^l ovaSd may degrade 

below, along with a sohmon that has been implemented. ss performance sigmncanUy^rlence, for each of deadS 

JLTST 00 ^ do r- ilush - d - emon is ^ » »<« ^"tions listed bdow. more details fo"ow dSrtlSS 

£15 «■ ° WD ^ reqUMtS 10 ,hc We sys,em s P ec ^ ^"tions that are simple and effldent and have 

implementation, as asynchronous writes. This requirement been used with the DOW implemenUon 

arises from the consideration that if thedow_flush_daauon VO Inter-dependendes' 

^?J t nl f !"J t . Wri,e ,0 J com P lete - ^ ^ Dode ~rre- so VO inter-dependendes created by DOW could rive rise to 

rS, d £ H° Pefa fl 0 1. *! P fedecessors « hat oeed to <=ydes. Furthermore, file system implementations frSS 

^ previously, then an impasse overhead, in an operation called "dustering". Such VO 

w^i ef o a C s^^.^T? 5 ' * W Ma * " eW > WddeD -^epe'ndenc e, 

writes to a separate daemon that can write pages 65 Resource Inter-dependencies: 

vSl 0 , S,f C H° St ^ SOmC rf P^ 0 ™""*- Perf °nni"8 I/O requires resources. A potential for dead- 
With uus as the background, staleness of page-modification lock exists when DOW subsystem attempts to acqture 
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Memory Exhaustion with Blocked Pageout Daemon: Resource iDter^ependendes^^ 

fr^L^^T 011 1 • ? SyStCm pr0C f SS mal to uscd to Witfa DOW, a write operation undertaken by some process 

a. *a ^ , „ ^ sraaras'ttrjasaas 

graph does not remain acyclic. This condition is prevented 10 illustrates this problem. ^ ^ 

by cycle checking in dow_order. To perform exact cycle ^ M M u M . - v , t , 

detection, that is, to identify exactly the situations in which „^f e £lw ^ Y loduA ' 40(1 " fiucs a 

a cycle arises, requires that the DOW subsystem either A " " , 

perform an exhaustive search with every addition of a link A ^f (<>rpage)ZniustprecedeYtodisk, soanattcmpt 

to the ordering graph, or, maintain a transitive closure (that 15 „ "uaated (which may or may not be in the context of 

is, node-to-node reachability information) over the full ' to acquirc me appropriate lock on Z. 

ordering graph. Both methods of detenmning exact connec- Process P2 holds Z and is waiting to acquire Y before it 

tivity are computationally expensive. can release Z. 

The DOW implementation uses the following simple, but Resource dependence cycles are commonly prevented by 
fail-safe heuristic in lieu of performing exact cycle detec- 20 usin 8 resource hierarchies by which all subsystems perform 
tion. Adding a link A->B to the ordering graph is permitted resource acquisitions. In hierarchy based allocation, each 
if either (1) A does not have a predecessor node, or (2) B source category is assigned a hierarchy level, and resource 
does not have a dependent node. Otherwise, the DOW acquisition is permitted only in an increasing (or deaeasing) 
subsystem executes one of A or B before permitting the link 0lder of Wera rchy levels. This ensures that two agents 
A-^B. While the above heuristic successfully detects true 25 attem P till g to acquire two resources do not each acquire one 
cycles and prevents their occurrence, it has the potential for of two resources and then wait indefinitely for the other 
inferxing cycles that do not exist and thereby cause some loss resource - a hierarchy allocation rule may have pro- 
of efficiency from node execution. hibited the allocation of Z to a process that holds Y as a 

However, the heuristic is very successful in avoiding false sim P Ic means of preventing a resource interdependence 
cycles because of a characteristic of DOW usage. The 30 from ^tog- 

construction of an ordering graph by a file system imple- B " difficult or impossible to apply hierarchy rules to the 

mentation usually proceeds linearly, as in the example of operation of the dow_Jflush_daemon. Activities of the 

TABLE 5, with predecessors nodes created first and depen- dow_flush_daemon cause the dismantling of pre-existing 

dent nodes added in sequence. The likelihood that two {inks m me ordering graph, so that new links can be added, 
pre-formed chains of dependence links ever fuse is very low; 35 ^ 80 ^ ^pendent nodes are readied for execution. If a 

so that as new nodes are added to an existing ordering chain, hierarchy rule forces the dow_jflush__daemon to wait for a 

they go to the end of the chain. In other applications of the resource such as a specific buffer, and the process holding 

DOW mechanism, a more sophisticated heuristic or accurate toat buffer waits uotiI toe dow_£ ush_daemon can succeed 

cycle detection may be more beneficial. ia removing an ordering link in the graph, then an impasse 

In addition to cycles, VO interdependence could result 40 deveio PS between the two processes. One solution is to have 

from I/O clustering that is frequently performed by file ^ °P CTmiD £ system maintain an explicit knowledge of 

system implementations. Clustering is a performance opti- resource ownerships, and force a process to release a specific 

niization in which read or write operations on two or more resource (to break the deadlock), and fall back to a point 

contiguous disk regions are combined before being pre- £rom which ft 0811 re-initiate its operation. Generally this 

sentcd to the disk drivers. With DOW, clustering presents a 45 solutioi > degrades performance noticeably, 

deadlock potential, since it may create undetectable VO wim D0W usa g^ a less general but simple and effective 

dependency cycles. A simple example of such a cycle is solution exists. In this solution a file system implementation 

when two writes are combined as a single I/O but in which is a D0W ^ cat is required to observe this rule: 

one operation has a direct or indirect DOW dependency A process is not allowed to hold a resource (e a a pace 

linkage w.th respect to the other operation. 30 or buffer) which is represented in the ordering ' kaphas 

inter-dependence arising from clustering can be avoided toe subject of a write operation, while blocking for 

by a range of solutions. A general, exhaustive solution would another resource that is also similarly represented, 

be to cover an entire cluster unit with a special DOW graph ™ s solution is not restrictive because a directory page or 

node so that the individual dependence links for each a buffer containing structural information, needs to be held 

member of a cluster are brought together as dependence 55 locked by a process in two cases: 

Zffc'o^^ D< ^ gr , aph °,° <,e fM clUtter 0n « 0i,,6 to™ *c page (or 

unit. Another solution to modify clustering so that the VO buffer) to be locked. Device driveToncxations do not 

Sr^~ mb ^ b ^ $Uedto ^ ck$uree " to * ^"^telylongtoetoco^ktefaXnotne^ 

scU perfonn streamlined accuses to disk. «, resource is hdd at a time. 

disk blocks ma, are represented oSj g^J as T^S^JS^ other P'°<*«<*- 
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There are a small number of instances in the UFS file different sequence of these commands, all scripts are corn- 
system implemeittation that are modified for enforcing the posed of the same aggregate command mix, so that the total 
above rule of holding one resource at a tune. In all these work done by each of the simulated users is identical In the 
instances, the file system implementation was previously absence of DOW usage, the performance of the bench- 
holding murapie buffers or pages in the course of applying 5 marked system is found to be limited by the speed with 
updates. The previous policy of simultaneously holding which read/write I/O is performed to disk. N 
more than one page or buffer in this way was arbitrary and Normalized System Throughput— FIG 15 /I ^ e~ \ I 

hence was easily corrected. PIG. 15 compares the normalized system throughput with 117 ^ ' ^ 

Memory Exhaustion with Blocked Pageout Daemon and without DOW usage. DOW usage j^rorcs the SPEC V 

The pageout daemon handles lists of free but modified to SDET measure of system's peak performance by more than 
!2?/m', A™* r them , t ° ^Jxiott they can be a factor of two. This bar^aTtnc bench^keZS 
available for fresh use. In order to submit a page to a disk moves from being diskbSund in the base case to becoS 
™' * C /W° ut daemon uses a memory unit called bound by the processor execution speed with DOW usage 
THiner header" to record some auxiliary information about Disk Write Operations Reduction— FIGS 16 and 17 
* e " d ^passes the address of the buffer header to is The reduction in disk write operations is confirmed by 
the disk driver. Normally, the memory needed for this FIG. 16 which shows that the number of disk write opera- 
purpose b obtained from a dynamic kernel memory alloca- Hons reduces by 85% the number without the use of DOW 
tor (KMA). to which it is returned as writes complete. An added benefit from the reduction of synchronous writes 

To permit graceful system recovery under extremely is a 50% drop in the aggregate number of process context 
severe memory conditions, the kernel maintains a small, 20 switches, as shown in FIG 17 
"contingency" memory pool for critical uses. If KMA can- Decrease In Disk Service Time— FIG 18 
not furnish the needed menwry to the pageout daemon, then An additional indirect benefit of using delayed ordered 
the kernel provides the needed memory from this eontin- writes is a decrease in the average disk service time as 
gency pool, and as the page writes complete, returns the shown in FIG. 18. The decrease In disk service time occurs 
^1^.^ i C T D8C fl nq L P ? 1 - *° media- 25 because the dow_flush_daemon can issue multiple concur- 
niamis used, *e dow^ush_daemonis provided memory rent disk requests (since its disk operations are not 
from the contingency pool as well, for the asynchronous synchronous), and thus enable the disk driver to better 

page writes that it needs to perform, if KMA cannot furnish schedule these dis k writes far h~t AMr r . rf ~ minrr 

the needed memory Qfetwork Kle System Bnvironment-FIG /^ ( n ,"1 ) 

As the number of deferred page writes increases, the 30 The<x>nautermtem3ofFlnn?¥al5v e r li v t t Bm «h,t \i J ' 
™^ ^.t ^ * m% ^ c onnects tea client system in the form of 

gency pool to cover them in the event of a severe memory T he computer syste m 3 of FIG. 19 lik eT Ftff i t 
StZ^L S? ge0Ut ? cannot continue to operate in composed ol hardware & and software t. Tt eTarHw^ 5 

™TZJ£ D a ^* d " d j°<* « «! sult i ncludes one or more processors 10. tv ScaTim^.,. 

This deadlock is prevented by detecting that the count of 35 processing unint!l>liv")r f a {n I I inputotPUtlTO^ 

t^EZllin 2f dtpCn n enCe V? 00 " y deWCCS • • ^-N ^Xare^nclu a^n;^!; 
S Sy,^^?.K « "T^' U0U1 ^ C0lmt system 14 and user (application) programs 15. The compuTer 
SaS ^ meaBtlme '. tne s >; stem * <° system 3 executes user programs 15 inme hardware sTdcr 

revm to using synchronous writes instead of creating new <o control of the operating system 14. A common instanceof 
ordering dependences. This operation is done transparently the ooeratino «v«t,.,n 14 u ( tf , in v rr™ ooexahnB svstem. 

£?, ^^"^^ntation. hardware 5' and software «\ The hardware inclu^Tone or 

Performance Gain Prom Delayed Ordered Writes more processors Iff. typically a central processing unit 

use of delayed ordered writes in UFS shows that dramatic input/output devices 13 -1, . . 13'-N. The software € 
SSSwt^ by ,ynCbJOnOUS diik includes « operating system 14' and u Sff (Jp. cS") % o 

V H^I^ c" emade 0D " IS'. The computer system 3' executesTser prognL 

TTllf I££ mi. * Sy! f n -™ tmU,8 ° n " 15 ' m mc hardware S ' mda c° ntrol «* operating sfstem 

ZZt £ ^ P ^ SSOr ^ * C ^ SUb " 50 14 A common '»*««« of the operating syrtem 14' I the 

system. The benchmark used for measuring system perfor- UNIX® operating system 

13? ri? C Perf0n T" EvsIuation C «T*™- ^T^WS^rO-l and 13-1 in the server system 3 and 

X^SSrSL D ^^\^^ Throughput / the client system 3'. respectively, connect the systems 3 and 
(SPEC SDBT) bendmwk (see System Performance Bvalu- / 3' together through a network 50 to form a Network FUe 
aaonC«^ h ve.057.SDBTBendmwk:Auser'sG U ide). L System (NFS). The network 50 is any Conventional 
The comparison is drawn by normalizing the measurement V^ea network. 

^ ^t.7 SUltS ° btaiDCd Wial0U, thc USC rf dcUyed AB^ca^brTnTus; of delayed ordered writes in place 
Thr<:r£rQiwrv_ k u- , .- °f ^nchronous writes to Improve response time and 

me J»rec benchmark simulates a mulu-user work- throughput in the Network File System is described. 

Z?. !!* f 08 ™? de r el °P me » t environment and measures 60 The Network File System (NFS) permits sharing of files 
toughputas a function of the number of simulated users. between different computer systenTconnecteTvL a net 

of a UNIX shell script that is composed of a randomized . system, such as NFS client system 3\ to access fileson a 

n^< t r7^ A ^T?y aiptCO ' lSiaSO t < usu ^)^cn«c*m P utersyLrn.suchass^sysem3 
commands to edit Mes, compile and Unk programs, create, 65 On an NFS server system 3. several processes caUed server 
remove and copy directories and flies, and text formatting processes execute L order to 
and speU^hecking documents. While each script executes a access services to client system ? ^ 
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back us toe network 2 ^ D OW node for ftnction caU to ^a, the server 

Include data or status Information. , fx,"' . 

Before it sends a response back to the client the server B ™ 0 t ^ n to - memor y of todirect block, if necessary. 
Process must ensure that all data modWcafonY SsocSed iSSS, fTF 0f block »as been 

with the request have been written totofc Fo exa^f 2 ,« ? ^ aWl f h ™ specifying that the fimc- 

a file writelquest, toe servers ^SS dl S iS£ V ^ is to be «« 

to disk along with any structural nwdffic^suchas^ containing the indirect block. C If the 

inode and indirect blocks, wait for those write? to comnil ^ bl °* ^ 1x56,1 modified ' tosue or asyn- 

Server processes typically use synchronous writes tomeet ,« u, -. mcm °2 C0 W ° f B. Establish an 

this requirement For example, to S a wrELueTJ ''"^f^ ^ ftC ftaB0,to " wbich <M *e 

server process uses synchronous wrta to wriE AtaSfa £T pr °? u 5S be called after writing the buffer 

and,if oect^JLuodS^ a SZf* """T^ ^ dak C. Issue 

waits for each synchros J£tT2 o^^TST" ^ 10 ** ° f to ^ 

complete before proceeding, this use of synchronous writes » 5 A. uXZ ^ . , 

guarantees that the data will be on disk befi^Z™™,.!! j£, ^ ^i™ cmoiy «»ode. B. Establish an 

message is sent * * e rc4p00,e ordenn 8 specifying that the function which signals the 

TVpically, a server process takes these steps, in time order ZIK" . u to be called after writing the buffer 

to service a client request for writing a filT ' COpy of * c C Issue a 

1. Acquire lock protecting file data structures M or synchronous write to disk of the buffer 

to disk, if modified. g mwea otocic 7. Wan for s lg nal (coming from function) that the disk writes 

3 w^°b^ 8.^^^ back to user. 

4. A. Modify in-memory copy of inode. B. Synchronously sch^e^T^r^^^^^ ^Fevious 
write biiffer containing inode to disk. ,>y,,cnronous, > r ««*. "if file lock is only held during the time that 

5. Release lock protecting file data structures £?J m f ^ ls aoHSed, «<» the DOW graph is 

6. Send response mcssagf back to uS f^"" 1 '' 11 15 ^ ual before the diskwrites 
One major drawback of this scheme is that if multim, « ^ ^ 5 J" 6 "™ ^ lockfe heW for less time on 

requests arrive for the s^ t^ Z^Z ^Zl^ a ^ bn ^^ K ^ 1 ^ 

dling these requests will each need to £HSt *same £k ^^8^ T-T™ Pr0CMS 

in step 1, above). One process at a timVwould acquire toe mWo^Z ^ SMVld,,g te "-J"" 1 

lock, perform steps 2, 3, and 4. and then release the lock for Secondly to thi« crh™. - 

another process to acquire. The time for which the lock is « the *^'v.^ re< ' UMt k hand ed to 

held includes the duratfon of time toat each of toeVnAro- J«™ *f ^1°* ^ preVl ° US 0ne ""P 1 ""- ™* 

tmie.Theoiskdriveristousforc«dtoschediU e ihr^;« in «, , , s ( DOW )- W W* is used by a file system 
^^to^t^uc^^S^^^ 50 ^^^toscheduJe^vvritmgofasetofdauL^ 
order would be more efficient Ate Tutue Z no Z£Z of * "T" 4 ^ Stolage 80 ** 11,6 ^8 of 

work happens between the disk drivS SZ Server Sd^Tl^ fa * Spedfic ' <ksirecl order - Use of 
processes since the disk driver staysEu^g S KLLTm^ l ° Cmp,0y s y nd »»"o«s disk 

between the completion of one write and the inWation of toe ss TVntr ^ ° r ^ erin f "»»« writes, 
next, and since toe server process needs to waft to Lch ^f nlfl ?, n , P ^T^ bMefltt of usin 8 ^ faeUi, y are sig- 
the disk writes to complete. ° f writes axe reduced by nearly an order of 

An alternative scheme would be to use delayed ordered «f * typl< ? 1 . lJNK ope^ system workloads 

writes in place of synchronous writes, to toi SemeT S«Z ^lu^^f Wnte cachifl 8 *« " 
server process could instead issue delayed ordere^^ «, f« ,1^ ' ^ *? ye ?.!?* l » * conle « switching 

and specify an ordering relationship betw^hlS * I'tZ^^* ^™*^*™^* 
a deferred function call which would signal the server TheDOWu^L.nH^ 

process when sll the writes have completed. Instead of the *ZT« ^^ a ,f ^f/rf°n^ee gains were measured in 
holding the lock across all of its operations Twi«™™L« . • WS 616 ty 5tem implementation in a 

m-memory copies of the information uut iTmSj and SowlLr £ ,» h tt W^ Operating System). 

*e time spent waiting by a server process for ^ ^SZT^A^T^ 
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Snrft^Tl ^ USCd * f * 1 eMB * 1 * to toe 3. The ordering subsystem of daim 2 wherein said delete 

benefits already achieved in some file system implement*- means includes operation delete means for ^EteToS? 

^£ttj^£'2Z5£2X > ^ — ^^ord^es.omsaidorde^ 

system implement^ons While it is available as a mecha- 4. The ordering subsystem of claim 3 wherein said delete 

Use of delayed ordered writes ia some applications can 8. The orderine subsystem of claim i m m « * 

=S5SSS25?r * ^^-5SSS5=5 

i An ™-A^ nn .. u , „. 23 said operatioas include writes from said primary storaoe to 

L An ordenng subsystem for controlling the order of said secondary storage and may include 

flies In response to new requests where a sequence of ui>datereou«t« »i »i , y "r* 01 cuun * wnerein said 

requests for said operations fa represented by me Quests ^IZZ ^^Z ^l^r^^ 

R1 *f ^ *e request for said options XSn JSSSJSrf "-ft-SS^ 

in said sequence have order dependencies Dl, D2 Dd delaved «, »< m «io upoaie requests Is 

S^EST ° Ut ^ - -ce.sarily in the time order ox the 

an ordering store for storing a plurality of entries, each of 12. The ordering subsystem of claim 9 wherein said add 

said entries containing an operation type identifying means operates asynchronously with respect to said execu- 
onc of said operations for files, at least one of said 40 tion means. 

entries at some time also containing a link which links 13. The ordering subsystem of claim 9 wherein said delete 

saidcntry to another of said entries, said link specifying means operates asynchronously with respect to said execu- 

an order tor carrying out said operations in said linked tion means. 

enfries said entries and said links defining a partially 14. The ordering subsystem of claim 1 wherein said 
ordered acyclic graph, 45 system includes a local unit and a remote unit connected by 

add means for adding entries to the ordering store by a net work. 

processing said new requests to identify one or more 1S * ^ ordering subsystem of claim 14 wherein said first 

common operations COO, COl COco, each of said ^ " Primary storage in said local unit and said second unit 

common operations identifying an operation requested l& sccon ^ storage in said local unit, wherein said requests 

by one or more of the requests Rl, R2 Rr, where 50 "P** 10 requests and said operations include writes from 

said common operations have common order depen- said Primary storage to said secondary storage and may 

dencies CDO, CD1 CDcd that preserve the onicr include function calls, and wherein said remote unit initiates 

dependencies Dl, D2 Dd between the operations said "quests for writes from said primary storage to said 

in the requests, and where co and cd are integers, secondary storage over said network. 

execution means for executing said one or more common 55 l6 ' ^ ordcrcd subsystem for controlling the order 

operations CO#, COl COco responsive to the ° f °P cralions & connection with writes from primary storage 

entries in the ordering store, and t0 sccon< kry storage in a computer system, the computer 

delete means for deleting entries from the ordering store havi . ng ? ta or S ani2 ^ d nJcs, having primary star- 

2. The ordering subsystem of claim 1 wherein said add £? C / Sl ? rUlg filcs ' a secondary storage for storing 

means includes, 60 hlcs > hav "»g a file management subsystem for controlling 

dow_create means for providing one or more of said °h a," betWCCD phjnaiy Sloragc and s «*ndary 

entries as operation entries for ^mnTcortZu ' ^ ^ specifying opera- 

operations, and "jurying common U ons in connection with writes from primary storage to 

dow.cdcr means for providiug one or more of said es sXo^atK 

ttzzs* for ^ — T cs r ^^^^^^ 

where «•» u P<kte requests in said sequence have order 
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dependencies Dl, D2, . . . , Dd and where r and d are 
integers, said order dependencies constraining the order for 
carrying out said operations, said ordered write subsystem 
including, 

an ordering store for storing a plurality of entries, each of 
said entries containing an operation type identifying 
one of said operations, at least one of said entries at 
some time also containing a link which links said entry 
to another of said entries, said link specifying an order 
for carrying out said operations in said linked entries, 
said entries and said links defining a partially ordered 
acyclic graph, 

add means for adding entries to said ordering store by 
processing said new update requests to identify com- 
mon operations, said common operations including, 
one or more common writes CW1, CW2, . . . , CWcw 
for a combined operation requested by one or more 
of the update requests Rl, R2, . . . , Rr where cw is 
an integer less than r, and one or more function calls 
FC1, FC2, . . . , FCfc where fc is an integer, and 
wherein said common writes and said function calls 
have common order dependencies CD1, CD2, 
CDcd that preserve the update order dependencies 
Dl, D2, . . . , Dd between the operations, where cd 
is an integer, 

execution means for executing common operations 
including, 

write means responsive to the entries in the ordering 
store for writing from primary storage to secondary 
storage with said common writes CW1, CW2, . 


3d 


10 


15 


20 


25 


CWcw constrained by the common^ order 30 ^^Z^IT^^ * 
dependencies CD1. CD2 PDrH 30 mduding < sM mctfaod comprising: 


35 
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dependencies CD1, CD2, . . . , CDcd, 
function means for executing said function calls, and 
delete means for deleting entries from the ordering store. 

17. The ordering subsystem of claim 16 wherein said add 
means includes, 

dow_create means for providing one or more of said 
entries as operation entries for identifying said common 
operations, and dow_ordcr means for providing one or 
more of said entries as order entries for identifying said 
common order dependencies. 

18. The ordering subsystem of claim 17 wherein said 
dow„create means provides said operation as common- 
write entries for identifying common writes. 

19. The ordering subsystem of claim 17 wherein said 
dow__create means provides said operation entries as 45 
function-call entries for identifying function calls. 

20. The ordering subsystem of claim 16 wherein said 
ordering store includes ordered locations with higher-order 
locations and lower-order locations and for each common- 
write entry, zero or more higher-order locations for an order 
entry pointing to zero or more high-order common writes 
and zero or more lower-order locations for an order entry 
pointing to zero or more lower-order common writes. 

21. The ordering subsystem of claim 16 wherein a write 
for one or more of said update requests is delayed so as to 
be part of one of said common writes. 

22. The ordering subsystem of claim 16 wherein said 
update requests Rl, R2, . . . , Rr are presented in time order 

with Rl and before R2. R2 before R3 R( r -1) before 

Rr and wherein a write for one or more of said update 
requests is delayed so as to be part of one of said common 
writes whereby writes for one or mare of said update 
requests are not in the time order of the update requests Rl 

R(r-l), Rr. ^ 

23. The ordering subsystem of claim 16 wherein said add 
means operates asynchronously with respect to said execu- 
tion means. 


24. The ordering subsystem of claim 16 wherein said 
delete means includes operation delete means for deleting 
operation entries from said ordering store and includes order 
delete means far deleting order entries from said ordering 
store. 6 

25. The ordering subsystem of claim 24 wherein said 
operation delete means includes dow__ abort means for delet- 
ing entries from said ordering store. 

26. Hie ordering subsystem of claim 25 wherein said 
dow_abort means operates asynchronously with respect to 
said execution means. 

27. The ordering subsystem of claim 16 wherein said 
primary storage includes a cache and wherein said ordered 
write subsystem causes said file management subsystem to 
initiate writes from said cache. 

28. The ordering subsystem of claim 26 further including 
device drivers connected to write from said cache to said 
secondary storage. 

29. A method in a computer system having a first unit and 
second unit for files, having a file management subsystem 
for controlling operations for flies, said file management 
subsystem specifying operations for files in response to new 
requests where a sequence of requests for the operations is 

represented by the requests Rl, R2 Rr and where the 

requests for the operations in said sequence have order 

dependencies Dl, D2 Dd where r and d are integers, 

said order dependencies constraining the order for carrying 
out the operations, said computer system including an order- 
ing subsystem for controlling the order of operations 


50 
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60 
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storing a plurality of entries in an ordering store, each of 
said entries containing an operation type identifying 
one of said operations for files, at least one of said 
entries at some time also containing a link which links 
said entry to another of said entries, said link specifying 
an order for carrying out said operations in said linked 
entries, said entries and said links defining a partially 
ordered acyclic graph, 
adding entries to the ordering store by processing said 
new requests to identify one or more common opera- 
tions COl, C02 COco, each of the common 

operations identifying an operation requested by one or 
more of the requests Rl, R2> ..... Rr, where said 
common operations have common order dependencies 
CD1, CD2, . . . , CDcd that preserve the order depen- 
dencies Dl, D2, . . , Dd between the operations in the 
requests, and where co and cd are integers, and 
executing said one or more common operations COl, 

C 02 COco responsive to the entries in the 

ordering store. 

30. The computer method of claim 29 wherein said adding 
step includes, 

dow_create step for providing one or more of said entries 
as operation entries for identifying common operations 
and 1 

dow_order step for providing one or more of said entries 
as order entries for Identifying said common order 
dependencies. 

31. The computer method of claim 30 further comprising 
a delete step which includes an operation delete step for 
deleting operation entries from said ordering store and 
includes an order delete step for deleting order entries from 
said ordering store. 

32. The computer method of claim 31 wherein said delete 
step includes a dow_abort step for deleting entries from said 
ordering store. 
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53. The computer method of claim 32 wherein said an order for carrying out said operations in said linked 

dow_abort step operates asynchronously with respect to entries, said entries and said links defining a partially 

said execution step. ordered acyclic graph, 

34. The computer method of claim 29 wherein said add steps for adding entries to said ordering store by 
operations include writes from said first unit to said second 5 processing said new update requests to identify corn- 
unit, noon operations, said common operations including. 

35. The computer method of claim 29 wherein said one or more common writes CW1, CW2, . . . , CWcw 
operations include function calls. for a combined operation identifying an operation 

36. The computer method of claim 29 wherein said first requested by one or more of the update requests Rl, 
unit is primary storage and said second unit is secondary 10 ^ . . . » Rr where cw is an integer less than r, and 

storage and wherein said requests are update requests and one or more function calls FC1, FC2 FCfc 

said operations include writes from said primary storage to whcrc fc is 80 integer, and where said common 

said secondary storage. writes and said function calls have common order 

37. The computer method of claim 29 wherein said first dependencies CD1, CD2, . . . , CDcd that preserve 
unit is primary storage and said second unit is secondary 15 ^ c u P datc ordcr dependencies Dl, D2, . . . , Dd 
storage and wherein said requests are update requests and between the operations in the requests, where cd is 
said operations include writes from said primary storage to mtc ger, 

said secondary storage and may include function calls. executing common operations including, 

38. The computer method of claim 37 wherein a write for write steps responsive to the entries in the ordering 
one or more of said update requests is delayed so as to be 20 store for writing from primary storage to secondary 
part of one of said common writes. storage with said common writes CW1, CW2, ...» 

39. The computer method erf claim 37 wherein said update CWcw constrained by the common-write order 
requests Rl, R2, . . . , Rr are presented in time order with Rl dependencies CD1, CD2, . . . , CDcd, 

before R2. R2 before R3 , R(r-1) before Rr and wherein function steps for executing said function calls, and 

a write for one or more of said update requests is delayed so 35 deleting entries from the ordering store. 

as to be part of one of said common writes whereby writes 45. The computer method of claim 44 wherein said add 

for one or more of said update requests are not necessarily steps include, 

in the time order of the update requests Rl t R2, . . . , Rr. dow_create steps for providing one or more of said 

40. The computer method of claim 37 wherein said add entries as operation entries for identifying said common 
step operates asynchronously with respect to said execution 30 operations, and 

sty- dow_order steps for providing one or more of said entries 

41. The computer method of claim 37 further comprising as order entries for identifying said common order 
a delete step which operates asynchronously with respect to dependencies. 

said execution step. 46. The computer method of claim 44 wherein said 

42. The computer method of claim 29 wherein said 35 dow_create steps provide said operation entries as 
system includes a local unit and a remote unit connected by common-write entries for identifying common writes. 

a network. 47. The computer method of claim 44 wherein said 

43. The computer method of claim 42 wherein said first dow__create steps provide said operation entries as function- 
unit is primary storage in said local unit and said second unit call entries for identifying function calls. 

is secondary storage in said local unit, wherein said requests 40 48. The computer method of claim 44 wherein said 
are update requests and said operations include writes from ordering store includes ordered locations with higher-order 
said primary storage to said secondary storage and may locations and lower-order locations and wherein said order- 
include function calls, wherein said remote unit communi- ing store includes for each common-write entry, zero or 
cates over said network with said local unit to move data more higher-order locations for an order entry pointing to 
between said first unit and said second unit. 45 zero or more higher-order common writes and zero or more 

44. A computer method in a computer system having data lower-order locations for an order entry pointing to zero or 
organized in files, having primary storage for storing files, more lower-order common writes. 

having a secondary storage for storing files, having a file 49. The computer method of claim 44 wherein a write for 

management subsystem for controlling transfers of files one or more of said update requests Is delayed so as to be 

between primary storage and secondary storage, said file 50 part of one of said common writes, 

management subsystem specifying operations in connection 50. The computer method of claim 44 wherein said update 

with writes from primary storage to secondary storage in requests Rl, R2, . . . , Rr are presented in time order with Rl, 

response to new update requests where a sequence of update R2 before R3, . . . , R(r-1) before Rr and wherein a write for 

requests for the operations is represented by the update one or more of said update requests Is delayed so as to be 

requests Rl, R2, R(r-1), Rr, where the update requests 55 part of one of said common writes whereby writes for one 

in said sequence have order dependencies. Dl, D2, . . . , Dd or more of said update requests are not in the time order of 

and where r and d are integers, the order dependencies the update requests Rl, R2, . . . , R(r-1), Rr. 

constraining the order for carrying out the operations, said 51. The computer method of claim 44 wherein said add 

computer method including an ordered write subsystem for steps operate asynchronously with respect to said execution 

controlling the order of operations in connection with writes 60 steps. 

from primary storage to secondary storage, said method 52. The computer method of claim 44 wherein said 
including, deleting step includes an operation delete step for deleting 
storing a plurality of entries in an ordering store, each of operation entries from said ordering store and includes an 
said entries containing an operation type identifying order delete step for deleting order entries from said order- 
one of said operations for files, at least one of said 65 ing store, 
entries at some time also containing a link which links 

said entry to another of said entries, said link specifying + + ♦ * * 


01/28/2003, EAST Version: 1.03.0002 


